Skip to content

Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951

Merged
ulixius9 merged 6 commits into
mainfrom
fix/datalake-json-column-type-detection
May 11, 2026
Merged

Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951
ulixius9 merged 6 commits into
mainfrom
fix/datalake-json-column-type-detection

Conversation

@mohittilala
Copy link
Copy Markdown
Contributor

@mohittilala mohittilala commented May 7, 2026

Describe your changes:

Fixes #27950

Changes in OpenMetadata submodule (datalake_utils.py):

  • Empty dict/list columns now correctly typed as JSON/ARRAY instead of STRING
  • Skip ast.literal_eval round-trip for already-parsed dict/list values
  • get_children handles parsed dicts and JSON strings independently — no more TypeError log spam

Tests added (tests/unit/utils/test_datalake.py):

  • Unit tests for fetch_col_types and get_children with parsed objects, empty containers, mixed types
  • End-to-end tests reading real fixture files through the full _read_json_object → _get_columns pipeline

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Bug fix

  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

Summary by Gitar

  • Type detection logic:
    • Replaced lexicographic max() type resolution with _TYPE_PRECEDENCE mapping in fetch_col_types.
    • Prevents structured types (dict, list) from being incorrectly downgraded to STRING in mixed-type columns.
  • Testing additions:
    • Added TestFetchColTypesMixedTypes to verify correct resolution for mixed dict/str, list/str, and numeric column types.

This will update automatically on new commits.

@mohittilala mohittilala self-assigned this May 7, 2026
Copilot AI review requested due to automatic review settings May 7, 2026 03:17
@mohittilala mohittilala requested a review from a team as a code owner May 7, 2026 03:17
@mohittilala mohittilala added bug Something isn't working Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch python Pull requests that update python code labels May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an ingestion bug in the Datalake connector where JSON-like columns (especially empty {} / [] values coming from single-object JSON files) were incorrectly inferred as STRING, and where parsing children could emit repeated TypeError debug logs.

Changes:

  • Update column type inference to treat non-null object columns as candidates even when values are falsy containers, and avoid unnecessary ast.literal_eval for already-parsed dict/list values.
  • Rework JSON children extraction to handle mixed parsed-dict and JSON-string values without TypeError noise.
  • Add unit + fixture-based tests covering parsed objects, empty containers, and single-object JSON ingestion behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
ingestion/src/metadata/utils/datalake/datalake_utils.py Fixes type inference for empty dict/list values and makes get_children robust to parsed objects vs JSON strings.
ingestion/tests/unit/utils/test_datalake.py Adds targeted tests for fetch_col_types/get_children and fixture-driven single-object JSON parsing.
ingestion/tests/unit/resources/datalake/dbt_manifest.json Adds a representative single-object dbt manifest fixture with multiple empty-object fields.
ingestion/tests/unit/resources/datalake/dbt_catalog.json Adds a representative single-object dbt catalog fixture with nested dicts and nulls.

Comment thread ingestion/src/metadata/utils/datalake/datalake_utils.py Outdated
Comment thread ingestion/tests/unit/utils/test_datalake.py
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🟡 Playwright Results — all passed (21 flaky)

✅ 4064 passed · ❌ 0 failed · 🟡 21 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 297 0 2 4
🟡 Shard 2 757 0 11 8
🟡 Shard 3 780 0 1 7
🟡 Shard 4 786 0 4 18
✅ Shard 5 709 0 0 41
🟡 Shard 6 735 0 3 8
🟡 21 flaky test(s) (passed on retry)
  • Pages/AuditLogs.spec.ts › should create audit log entry when glossary is soft deleted (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Complete Incident lifecycle with table owner (shard 2, 2 retries)
  • Features/IncidentManager.spec.ts › Next, Previous and page indicator (shard 2, 2 retries)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test add article button (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/KnowledgeCenterTextEditor.spec.ts › Rich Text Editor - Text Formatting (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Create custom property and configure search for Dashboard (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Should verify property name is visible for apiCollection in right panel (shard 4, 1 retry)
  • Pages/DataContracts.spec.ts › Create Data Contract and validate for Pipeline (shard 4, 1 retry)
  • Pages/DataProducts.spec.ts › Search Data Products (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel.spec.ts › Should allow Data Consumer to edit glossary terms for searchIndex (shard 6, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings May 8, 2026 04:50
TeddyCr
TeddyCr previously approved these changes May 11, 2026
Copilot AI review requested due to automatic review settings May 11, 2026 08:56
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 11, 2026

Code Review ✅ Approved

Replaces lexicographic type resolution with explicit precedence in datalake utils to correctly identify JSON and array columns. Added comprehensive unit tests to ensure accurate type detection for empty containers and mixed-type inputs.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread ingestion/tests/unit/utils/test_datalake.py
@sonarqubecloud
Copy link
Copy Markdown

@ulixius9 ulixius9 merged commit 3d6fd71 into main May 11, 2026
64 of 66 checks passed
@ulixius9 ulixius9 deleted the fix/datalake-json-column-type-detection branch May 11, 2026 12:32
@github-actions
Copy link
Copy Markdown
Contributor

Failed to cherry-pick changes to the 1.12.7 branch.
Please cherry-pick the changes manually.
You can find more details here.

@github-actions
Copy link
Copy Markdown
Contributor

Failed to cherry-pick changes to the 1.13 branch.
Please cherry-pick the changes manually.
You can find more details here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Ingestion python Pull requests that update python code safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datalake connector: JSON columns incorrectly typed as STRING and TypeError logged when ingesting single-object JSON files

4 participants